Statistical versus knowledge-based machine translation
نویسنده
چکیده
over, and what kinds of symbol systems should we create for them? Everytime a new phenomenon is identified as a bottleneck or as problematic, the very actsof describing the phenomenon, defining it,and creating a set of symbols to represent its abstractions are symbolic (in both senses of the word!). The benefits: decreasedlearning time and more powerful rules,hence improved translation output quality. The picture is inverted on the symbolic/ linguistic side. Here the system is designedto use a great deal of knowledge about lexical features, grammatical word classes, and even perhaps semantic knowledge, as in thecase of Pangloss. But this knowledge must Even if the human's work isperfect and complete, thefact that one needs at least120,000 words to cover asignificant portion of alanguage such as English orSpanish means that it takesyears for a group oflexicogrammarians todevelop an adequate MTsystem. be built into the system. Lexicons of words and rules of grammar, acquired by humanlabor, are expensive to compile and slow toaccrue. Where a statistical system can siftthrough thousands of bilingual word corre-spondences an hour, a human cannot build more than a handful of detailed lexicalitems or grammar rules in that time. Even ifthe human's work is perfect and complete,the fact that one needs at least 120,000words to cover a significant portion of alanguage such as English or Spanish means that it takes years for a group oflexicogrammarians to develop an adequateMT system. Generally, in the real world,the oldest systems are still the best.But ARPA had only three years' fundingfor MT. And the ARPA MT program became increasingly ambitious, from initiallycalling for high-quality translations in onlya limited domain (necessitating a small but detailed lexicon), to ultimately requiring thesystems to handle unrestricted newspaper text. Over the four years of the program, ARPA held four formal evaluations, whichused various scales to compare translations produced by research systems, several com-mercial systems, and human experts.Pressure increased on Pangloss, the symbolic/linguistic system, to expand its lexicon and grammar dramatically. The onlyway to respond was to automate: decreasethe amount of information for each lexical item (because this usually requires human analysis), and acquire the lexical items andgrammar patterns by machine. This step immediately introduced statistics-like processing into Pangloss.Until they mature, symbolic systems thusrespond mainly to a drive toward coverage and robustness. Especially in the face ofincreasingly challenging evaluations, symbolic system researchers begin to developgeneral rules to avoid catastrophic failurewhenever the system encounters input forwhich specific rules have not yet been built.Such general rules usually provide not onlythe correct output for any input but a list of possible outputs for a general class of inputs.These outputs, which are correct at a certainlevel of generality, are filtered to select thebest altemative(s). But what filter? When thetask/evaluations prohibit human interven-tion, the filter must be automatic, and thus requires reliability indicators. By the twinmoves of computing reliability numbersand extracting information from resources(semi-) automatically, symbolic systembuilders take their inevitable steps towardstatistics. The benefits: a greatly expanded lexicon and more grammatical coverage, hence translation in larger domains.Once begun, the process of hybridization continued for Candide and Pangloss (Ling-stat and Japangloss, a sibling of Pangloss,were hybrids from the outset).
منابع مشابه
Improving Translation to Morphologically Rich Languages (Améliorer la traduction des langages morphologiquement riches) [in French]
Améliorer la traduction des langages morphologiquement riches While statistical techniques for machine translation have made significant progress in the last 20 years, results for translating to morphologically rich languages are still mixed versus previous generation rule-based systems. Current research in statistical techniques for translating to morphologically rich languages varies greatly ...
متن کاملA new model for persian multi-part words edition based on statistical machine translation
Multi-part words in English language are hyphenated and hyphen is used to separate different parts. Persian language consists of multi-part words as well. Based on Persian morphology, half-space character is needed to separate parts of multi-part words where in many cases people incorrectly use space character instead of half-space character. This common incorrectly use of space leads to some s...
متن کاملThe Correlation of Machine Translation Evaluation Metrics with Human Judgement on Persian Language
Machine Translation Evaluation Metrics (MTEMs) are the central core of Machine Translation (MT) engines as they are developed based on frequent evaluation. Although MTEMs are widespread today, their validity and quality for many languages is still under question. The aim of this research study was to examine the validity and assess the quality of MTEMs from Lexical Similarity set on machine tra...
متن کاملStatistical Machine Translation of Parliamentary Proceedings Using Morpho-Syntactic Knowledge
This paper presents an overview of the University of Washington statistical machine translation system developed for the 2006 TCSTAR evaluation campaign. We use a statistical phrase-based system with multiple decoding passes and a log-linear probability model. Our main focus was on exploring the possibility of using morpho-syntactic knowledge (lemmas and part-of-speech tags) for word alignment,...
متن کاملA Hybrid Machine Translation System Based on a Monotone Decoder
In this paper, a hybrid Machine Translation (MT) system is proposed by combining the result of a rule-based machine translation (RBMT) system with a statistical approach. The RBMT uses a set of linguistic rules for translation, which leads to better translation results in terms of word ordering and syntactic structure. On the other hand, SMT works better in lexical choice. Therefore, in our sys...
متن کاملExample-Based Machine Translation: A New Paradigm
Machine translation (MT) is aimed to enable a computer to transfer natural language utterances in either text or speech from one language into another while preserving the meaning and interpretation. MT technology has gone through several paradigms from its very beginning in the past half century, including word-to-word direct translation, rule-based transfer approach, inter-lingua approach and...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
برای دانلود متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید
ثبت ناماگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید
ورودعنوان ژورنال:
- IEEE Expert
دوره 11 شماره
صفحات -
تاریخ انتشار 1996